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Abstract 



In this study, items are drawn from a full-length test of 30 items in order to 
construct shorter tests for the purpose of making accurate pass/fail classifications with 
regard to a specific criterion point on the latent ability metric. A three item-parameter 
IRT framework is used. The criterion point on the latent ability metric corresponds to a 
criterion domain true score (80% correct), established by an expert panel. The shorter 
tests are compared to the full length test in terms of classification accuracy. Number 
correct (NC) scoring is used. We found that the classification accuracy of shorter tests 
meets or even exceeds that of the full length test. In general, a test targeted on a specific 
level of ability can be about half the length of a test designed to classify examinees with 
regard to several (five) levels of ability, without compromising classification accuracy. 
For lower levels of ability, where guessing at difficult items on the test contributes more 
measurement error than information, tests can be shortened even more. These 
conclusions are limited to tests in which pass/fail decisions are based on a number correct 
score. 





In this study, we were interested in constructing shorter versions of a test of 
Applied Mathematics with a view to maintaining or even improving the accuracy of 
pass/fail decisions. The test is used to assign level scores to examinees based on their 
number correct (NC) score. NC scores on the test range from 0 to 30. Level scores 
range from 0 to 5. There are three parallel forms of this test. On the particular form 
that we use in this study, the NC score ranges mapped to level scores 0 through 5 are, 
respectively, [0,11], [12,16], [17,20], [21,25], [25,28], and [29,30], The mapping of NC 
scores to level scores on the other forms is very close or identical for all levels on the 
other forms. The lowest NC score mapped to a given level score is the "cutoff score" for 
that level. For example, 12 is the NC cutoff score for Level 1. 

This test is often used in settings where users want to know only whether the 
examinee is at or above a specific level of skill. That is, the users might want to classify 
examinees with regard to being at-or-above Level 3, but are not interested in making 
further distinctions, such as whether an examinee who meets this criterion is higher than 
level 3. ' 

This study addresses the very practical task of developing shorter tests, which we 
will call "single-level tests" (SLT), for this purpose. For security reasons, it was decided 
that the one SLT for each level should be constructed by drawing on the items within just 
one of the three available alternate forms. The SLTs would thus collectively expose 
items from only one test form. (There is no item overlap among the available forms.) By 
drawing items from just one form, the construction of a given SLT can be viewed as 
deleting items from that form. Since each SLT is concerned with at-or-above 



classifications with regard to just one level, the SLTs are comparable in their purpose to a 
broad array of tests such as licensure and certification tests and formative mastery tests. 

Our developmental research on the SLTs is of broad interest for at least two 
additional reasons. First, the SLTs are similar to testlets that are used in computer-based 
testing. Options for computer delivery of tests includes the use of pre-constructed forms, 
or testlets, each of which contains relatively few items. The items on the pre-constructed 
form(s) might actually be a subset of the items on a given paper-and-pencil test form. 
Dichotomous decisions, such as which testlet to administer next, if not pass/fail 
decisions, might be based on the NC scores on these testlets. 

Second, the criteria for making at-or-above determinations with the SLTs, involve 
domain scores. Pass/fail decisions on licensure and certification tests, and on many other 
kinds of tests as well, are typically made with regard to a criterion true score on a domain 
of items. The criterion score may be established by any one of several possible standard 
setting methods such as the modified Anghoff method. For the present set of tests, 
including the full-length forms as well as the SLTs, content experts decided that mastery 
of a level should be defined by a criterion true score of 0.8, or 80% correct, or higher on 
the items representing the level. For the present set of tests, each level is represented by 
a pool of eighteen items. 

Methods 

The psychometric framework for mapping NC scores on full length test forms and 
SLTs to level scores is based on work described in Schulz, Kolen, and Nicewander 
(1999). This work is based on the 3-PL IRT model, implemented by BILOG. Items 
from all levels and forms are calibrated to a common scale. The IRT model is used to 



establish a correspondence between the criterion level-domain scores (80% correct) and 
points on the 0 scale. The criteria for mastery of levels 1 to 5 for the tests in this study 
correspond to 0s of, respectively, -1.44, -.43, .37, 1.49, 2.41. These values define the 
lower boundaries of levels 1 to 5 on the 0 scale. They are denoted, 0 M , M=l,...,5. Level 
0 has no lower boundary. 

The mean ± 1 standard deviation of the item parameter estimates for items on the 
form used to construct the SLTs were 1 .34 ± .36 for the a parameter (slope), -0.2 ±1.4 
for the b parameter (intercept), and .176 ± .064 for the c parameter (lower asymptote or 
'guessing'). Items were ordered on the test approximately by difficulty. Item p- values 
ranged from .216 to .965. Biserial correlations ranged from .129 to .818. Student ability 
in the parameter estimation model (0) is assumed to have an approximate normal (0,1) 
distribution. 

Construction of shortened tests: All shortened tests were assembled by drawing 
exclusively on the thirty items in the full length test form. For each type of classification 
(See Table 1), tests of length L, L=4,5,. . .,29, were constructed by choosing the L items 
that provided the most information (Lord, 1980, p 21) at the criterion theta (0 M ). The 
full length test corresponds to L=30. The test information function for NC scoring is 
(Lord, 1980, p 73): 




( 1 ) 



where Pj= P;(0) = the probability of getting item i correct conditional on 0; Qj=l-Pj, and 
Pi is the first derivative of Pj with respect to 0 (Lord, 1980, p 6\). 

We realize that incrementally adding items ordered by the value of their NC- 
information function at the criterion 0 to construct longer SLTs does not necessarily 
produce the best L-item tests for NC scoring (Hulin, Drasgow, and Parsons, 1983). The 
best L-item test does not necessarily contain all items from the best L-l item test. The 
main points of our study, however, do not depend on how the tests were constructed. 
Rather, we are concerned with the classification accuracy of shorter tests, however 
constructed, compared with that of the full-length test for specific pass/fail 
classifications. 

Establishing cutoff scores. Let X represent the random number correct score on a 
test. To find the cutoff score for assigning an examinee a level score of K (K=0,1,. . ., 5) 
on a test consisting of L items we found the minimum X that satisfied the following 
equation: 

ip i (0) = x, 0 > 0 M , M=K. (2) 

/=1 

For a given X, Equation 2 was solved for 0 on the left by the iterative method of half 
intervals. This method provides a first-order approximation to the maximum likelihood 
estimate of 0 from an NC score (Yen, 1984). 

Estimating classification errors. Let K= 0, 1 , ... ,5 and 6 represent respectively the 
assigned level score and true 6oi a given examinee. Let P + and P' represent the 
predicted proportion of examinees whose classification is too high or too low, given their 
true 6. A pass/fail classification error occurs when K<M and #>0 M (a false negative 
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error) and when K > M and 0<0 M (a false positive error). The conditional probabilities of 
false positive and false negative errors are defined separately for each level, M, as 
(Schulz, Kolen, and Nicewander, 1999): 

Y ¥ (M,d)= £ ?[(K = k)\0], 0<0 M , M = 1, ... , 5 , (3) 

k=M 

and 

?-(M,8)=Z?[(K=k)\0], 0>0 M , M = l,...,5. (4) 

k= 0 

Marginal error rates were computed by integrating the conditional error rates over 
a 0-distribution. For each type of classification (See Table 1) we assumed a uniform 0 
distribution centered at 0 M and having a range of 3. Integrations were performed by 
quadriture using 31 equally-spaced points. 

Results 

The lower plot of Figure 1 shows that about half of the items on the full length 
test contained practically zero information at a 0 of -1 .44 (the Level 1 critical theta). The 
upper plot of Figure 1 shows that the test information for the number correct score, 
conditional on 0=-l .44, peaks at a test length of 15. Adding more items to the test after 
the 1 5 th decreases information. 

Figure 2 shows test information for number correct scoring as a function of 0 for 
two tests: the 16-item test corresponding to one of the points near the peak in Figure 1, 
and the full-length (30-item) test. The 16-item test contains more information than the 
30-item test over a considerable range of 0— from the lower boundary of the 0- 
distribution we assumed for computing marginal error rates (lowest asssumed 0) up to 
about -0.4, where the two information curves cross. 
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Figure 3 shows the conditional probability of being classified as “at or above 
Level 1” for each test (16-items and 30-items). The probability of passing should be as 
low as possible below the target 0 and as high as possible above the target 0. On this 
basis, the 16-item test performs better than the 30-item test at all levels of 0, including 
levels above -0.4, where the 30-item test information function exceeds that of the 16- 
item test (Figure 2). 

Figure 4 shows the theta conditional on the optimal cutscore, as a function of test 
length — -the solution to Equation 2. At first, the cutscore increases one-for-one with 
increasing test length, but later the same cutscore (e.g., 11) applies to a range of test 
lengths. For a cutscore of 1 1 , test lengths range from 21 to 25 items. There is an 
important, within-cutscore trend in Figure 4: the theta conditional on a fixed cutscore, 
such as 1 1 , decreases as test length increases. The trend for a given cutscore would 
extent below -1 .44 were it not for the rule about choosing cutscores. (This rule is 
represented by the “0 > 0 M ” condition on Equation 2 above. 

Figure 5 shows marginal classification error rates as a function of test length. 
Separate plots are shown for false positive, false negative, and total (sum of false positive 
and false negative) error rates. As expected, the 16-item test has a lower error rates of 
each type than the 30-item test. Also, the within-cutscore trend noted in Figure 4 above, 
is reflected by within-cutscore trends in false-positive and false-negative error rates. For 
a fixed cutoff score, the false negative rate decreases, and the false positive error rate 
increases with test length. 

The following table summarizes the possibilities for shortening the test in any 
application that requires only one at-or-above classification. For each type of 



classification, a shortened test is identified by the number of items it contains and its 
marginal error rate (false positive plus false negative marginal error rates). No other 
tests for the same type of classification had a lower error rate or contained fewer items 
It is seen that the test could be shortened by about half, on average, if one is interested 
only in making a pass/fail classification with regard to one level of skill. 
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Table 1: Classi: 


Ication Error ' 


lates for Shortenec 


vs. Full Length Test 


Classification 


Critical 

Theta 


Number of 
Items in 
Shortened 
Test 


Total Error Rate 


Shortened Test 


Full Length 
Test (30 items) 


> Level 1 


-1.44 


12 


.095 


.123 


> Level 2 


-0.43 


21 


.099 


.117 


> Level 3 


.37 


16 


.102 


.108 


> Level 4 


1.49 


12 


.087 


.088 


> Level 5 


2.41 


4 


.142 


.150 



Table 1 is not meant to suggest that predicted error rates should be the only guide 
for constructing a test or choosing a test length. But test information may be an 
insufficient basis for constructing an optimal test, particularly when number-correct 
scoring is used. For example, compared to the 15-item test, both the 12-item test and the 
16-item test had less information at the Level 1 critical theta (See Figure 1), but had 
lower marginal error rates (See Figure 5). The 16-item test had the same marginal error 
rate as the 12-item test (.095). 

Educational Importance 

This research shows that many tests designed to yield pass/fail results, such as 
licensure and certification exams, could be shortened without negatively impacting 
classification error rates. Under some circumstances, shortening a paper-and-pencil test 
could be a reasonable alternative to computerized testing. This research also has 
implications for the administration of fixed forms, or pre-assembled testlets, by computer, 
if pass/fail or stop/continue testing decisions are based on number correct scoring. 
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Figure 1: Test and Item Information at Level 1 Critical Theta as 




Figure 2: Test Information Functions 
for Number Correct Score 
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Figure 3: Probability of Passing Level 1 
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Figure 4: Theta Conditional on Cutscore 
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Figure 5: At-or-above Level 1 Classification Error Rates 
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